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Abstract 

An existing approach for dealing with massive data sets is to stream over the input in few passes and 
perform computations with sublinear resources. This method does not work for truly massive data where 
even making a single pass over the data with a processor is prohibitive. Successful log processing systems in 
practice such as Google's MapReduce and Apache's Hadoop use multiple machines. They efficiently perform 
a certain class of highly distributable computations defined by local computations that can be applied in any 
order to the input. 

Motivated by the success of these systems, we introduce a simple algorithmic model for massive, un- 
ordered, distributed (mud) computation. We initiate the study of understanding its computational complexity. 
Our main result is a positive one: any unordered function that can be computed by a streaming algorithm can 
also be computed with a mud algorithm, with comparable space and communication complexity. We extend 
this result to some useful classes of approximate and randomized streaming algorithms. We also give negative 
results, using communication complexity arguments to prove that extensions to private randomness, promise 
problems and indeterminate functions are impossible. 

We believe that the line of research we introduce in this paper has the potential for tremendous impact. The 
distributed systems that motivate our work successfully process data at an unprecedented scale, distributed over 
hundreds or even thousands of machines, and perform hundreds of such analyses each day. The mud model 
(and its generalizations) inspire a set of complexity-theoretic questions that lie at their heart. 
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1 Introduction 



We now have truly massive data sets, many of which are generated by logging events in physical systems. For 
example, data sources such as IP traffic logs, web page repositories, search query logs, retail and financial trans- 
actions, and other sources consist of billions of items per day, and are accumulated over many days. Internet 
search companies such as Google, Yahoo!, and MSN, financial companies such as Bloomberg, retail businesses 
such as Amazon and WalMart, and other companies use this type of data. 

In theory, we have formulated the data stream model to study algorithms that process such truly massive data 
sets. Data stream models fO", T] make one pass over the logs, read and process each item on the stream rapidly 
and use local storage of size sublinear — typically, polylogarithmic — in the input. There is now a large body of 
algorithms and lower bounds in data stream models (see [12J for a survey). 

Yet, streaming models alone are not sufficient. For example, logs of Internet activity are so large that no single 
processor can make even a single pass over the data in a reasonable amount of time. The solution in practice has 
been to deploy more machines, distribute the data over these machines and process different pieces of data in 
parallel. For example, Google's MapReduce [8] and Apache's Hadoop 15] are successful large scale distributed 
platforms that can process many terabytes of data at a time, distributed over hundreds or even thousands of 
machines, and process hundreds of such analyses each day. A reason for their success is that logs-processing 
algorithms written for these platforms have a simple form that let the platform process the input in an arbitrary 
order, and combine partial computations using whatever communication pattern is convenient. 

In this paper, we introduce a simple model for these algorithms, which we refer to as "mud" (Massive, 
Unordered, Distributed) algorithms. This computational model raises several interesting complexity questions 
which we address. Almost all the work in streaming including the seminal [9, 2 J and its extensions [IJ have 
been motivated by massive data computations, making one or more linear passes over the data. The algorithms 
developed in this area have in many cases found applications to distributed data processing, e.g., motivated by 
sensor networks. Our work is the first to address this distributed model specifically, and attempt to understand its 
power and limitations. 

1.1 Mud algorithms 

Distributed platforms like MapReduce and Hadoop are engines for executing arbitrary tasks with a certain simple 
structure over many machines. These platforms can solve many different kinds of problems, and in particular 
are used extensively for analyzing logs. Logs analysis algorithms written for these platforms consist of three 
functions: (1) a local function to take a single input data item and output a message, (2) an aggregation function 
to combine pairs of messages, and in some cases (3) a final post-processing step. The distributed platform 
assumes that the local function can be applied to the input data items independently in parallel, and that the 
aggregation function can be applied to pairs of messages in any order. This allows the platform to synchronize 
the machines very coarsely (assigning them to work on whatever chunk of data becomes available), and avoids 
the need for machines to share vast amounts of data (thereby eliminating communication bottlenecks) — yielding 
a highly distributed, robust execution in practice. 

Example. Consider this simple logs analysis algorithm to compute the sum of squares of a large set of numbers Q 

X = input_record; 

x_squared = x * x; 

aggregator: table sum; 

emit aggregator <- x_squared; 

This program is written as if it only runs on a single input record, since it is interpreted as the local function 
in MapReduce. Instantiating the aggregator object as a "table" of type "sum" signals MapReduce to use 

'This is expressed in written Sawzall 1151 language, a language at Google for logs processing, that runs on the MapReduce platform. 
The example is a complete Sawzall program minus some type declarations. 
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summation as its aggregation function. "Emitting" x_squared into the aggregator defines the message output 
by the local function. When MapReduce executes this program, the final output is the result of aggregating all the 
messages (in this case the sum of the squares of the numbers). This can then be post-processed in some way (e.g., 
taking the square root, for computing the L2 norm). Large numbers of algoritfims of this form are used daily for 
processing logs ifTSl . 

Definition of a mud algorithm. We now formally define a mud algorithm as a triple m = (<I>, ©, rj). The local 
function <I> : S ^ Q maps an input item to a message, the aggregator ® : Q x Q ^ Q maps two messages 
to a single message, and the post-processing operator r/ : Q ^ S produces the final output. The output can 
depend on the order in which © is applied. Formally, let T be an arbitrary binary tree circuit with n leaves. 
We use m,7-(x) to denote the q ^ Q that results from applying © to the sequence <I>(xi), . . . , <I>(x„) along the 
topology of T with an arbitrary permutation of these inputs as its leaves. The overall output of the mud algorithm 
is then ?7(m7-(x)), which is a function S. Notice that T is not part of the algorithm definition, but rather, 

the algorithm designer needs to make sure that r]{mr{'x)) is independent of tH We say that a mud algorithm 
computes a function / if rj{m'r{-)) = f for all trees T. 

We give two examples. On the left is a mud algorithm to compute the total span (max — min) of a set of 
integers. On the right is a mud algorithm to compute a uniform random sample of the unique items in a set (i.e, 
items that appear at least once) by using an approximate minwise hash function h (see [^.T] for details): 





x) 


(^{x) = {x, h{x), 1) 


©((ai,6i), 


(02,^2)) = (min(ai,02),max(6i,62)) 


©((ai, /i(ai),ci), {a2,h{a2),C2)) 






_ ( {ai,h{ai),Ci) if /i(aj) < h{aj) 






\ {ai,h{ai),ci + C2) otherwise 


V{{a,b)) = 


h — a 


r/((a, 6, c)) = a if c = 1 



The communication complexity of a mud algorithm is log \ Q\, the number of bits needed to represent a 
"message" from one component to the next. We consider the {space, time} complexity of a mud algorithm to be 
the maximum {space, time} complexity of its component functions]! 

1.2 How complex are mud algorithms? 

We wish to understand the complexity of mud algorithms. Recall that a mud algorithm to compute a func- 
tion must work for all computation trees over © operations; now consider the following tree: ©(©(. . . © 
(©(g, <I>(xi)), <I>(x2 )),... , ^{xk-i)), ^{xk))- This sequential application of © corresponds to the conventional 
streaming model (see eg. survey of |[T2l ). 

Formally, a streaming algorithm is given by s = {(7,7]), where a : Q x E — > Q is an operator applied 
repeatedly to the input stream, and : Q — > S converts the final state to the output. The notation s'^ (x) denotes 
the state of the streaming algorithm after starting at state q, and operating on the sequence x = xi, . . . , in 
that order, that is, s''(x) = a-{a{. . . a{a{q, xi),X2), ■ ■ ■ , Xk-i),Xk). On input x S T,^, the streaming algorithm 
computes 7/(5*^ (x)), where is the starting state. As in mud, we define the communication complexity to be 
log IQI (which is typically polylogarithmic), and the {space, time} complexity as the maximum {space, time} 
complexity of a and t]. 

Streaming algorithms can compute whatever mud algorithms can compute: given a mud algorithm m = 
($,©,77), there is a streaming algorithm s = {(7,r]) of the same complexity with same output, by setting 

^This is implied if © is associative and commutative; liowever, tliis is not necessary. 

^This is the only tiling that is under the control of the algorithm designer; indeed the actual execution time — which we do not formally 
define here — will be a function of the number of machines available, runtime behavior of the platform and these local complexities. 
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a{q,x) = ©(g, <I>(x)). The central question then is, can a mud algorithm compute whatever a streaming al- 
gorithm computes? It is immediate that there are streaming computations that cannot be simulated by mud 
algorithms. For example, consider a streaming algorithm that counts the number of occurrences of the first ele- 
ment in the stream: no mud algorithm can accomplish this since it cannot determine the first element in the input. 
Therefore, in order to be fair, since mud algorithms work on unordered data, we restrict our attention to functions 
^ S that are symmetric (order-invariant) and address this central question. 

1.3 Our Results 

We present the following positive and negative results comparing mud to streaming algorithms, restricted to 
symmetric functions: 

• We show that any deterministic streaming algorithm that computes a symmetric function S" ^ S can 
be simulated by a mud algorithm with the same communication complexity, and the square of its space 
complexity. This result generalizes to certain approximation algorithms, and randomized algorithms with 
public randomness. 

• We show that the claim above does not extent to richer symmetric function classes, such as when the 
function comes with a promise that the domain is guaranteed to satisfy some property (e.g., finding the 
diameter of a graph known to be connected), or the function is indeterminate, i.e., one of many possible 
outputs is allowed for "successful computation." (e.g., finding a number in the highest 10% of a set of 
numbers.) Likewise, with private randomness, the claim above is no longer true. 

The simulation in our resuh takes time l^(2P°iyi°g(")) from the use of Savitch's theorem. So while not a 
practical algorithm, our result implies that if we wanted to separate mud algorithms from streaming algorithms 
for symmetric functions, we need techniques other than communication complexity-based arguments. 

Also, when we consider symmetric problems that have been addressed in the streaming literature, they seem 
to always yield mud algorithms (e.g., all streaming algorithms that allow insertions and deletions in the stream, or 
are based on various sketches lH can be seen as mud algorithms). In fact, we are not aware of a specific problem 
that has a streaming solution, but no mud algorithm with comparable complexity (up to polylog factors in space 
and per-item time)0 Our result here provides some insight into this intuitive state of our knowledge and presents 
rich function classes for which distributed streaming (mud) is provably as powerful as sequential streaming. 

1.4 Techniques 

One of the core arguments used to prove our positive results comes from an observation in communication 
complexity. Consider evaluating a symmetric function /(x) given two disjoint portions of the input x = • x^, 
in each of the two following models. In the one-way communication model (OCM), David knows portion x^, 
and sends a single message D{^a) to Emily who knows portion x^; she then outputs £'(L>(x^), x^) = /(x^i • 
x^). In the simultaneous communication model (SCM) both AUce and Bob send a message A(x^) and B{^b) 
respectively, simultaneously to Carol who must compute /(x^ • x^). Clearly, OCM protocols can simulate SCM 
protocolso At the core, our result relies on observing that SCM protocols can simulate OCMs too, for symmetric 
functions /, by guessing the inputs that result in the particular message received by a party. 

To prove our main result — that mud can simulate streaming — we apply the above argument many times over 
an arbitrary tree topology of © computations, using Savitch's theorem to guess input sequences that match input 

''There are specific algorithms — such as one of the algorithms for estimating F2 in [2| — that are sequential and not mud algorithms, 
but there are other alternative mud algorithms with similar bounds for the problems they solve. 

^The SCM here is identical to the simultaneous message model 1 3 1 or oblivious communication model 1161 studied previously if there 
are k = 2 players. For k > 2, our mud model is not the same as in previous work |31|16I . The results in l3l ll6l as it applies to us are not 
directly relevant since they only show examples of functions that separate SCM and OCM significantly. 
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states of streaming computations. This is delicate because we can use the symmetry of / only at the root of the 
tree; simply iterating the argument at every node in the computation tree independently would yield weaker results 
that would force the function to be symmetric on subsets of the input, which is not assumed by our theorem. 

To prove our negative results, we also use communication limitations — of the intermediate SCM. We define 
order-independent problems easily solved by a single-pass streaming algorithm and then formulate instances 
that require a polynomial amount of communication in the SCM. The order-independent problems we create are 
variants of parity and index problems that are traditionally used in communication complexity lower bounds. 

2 Main Result 

In this section we give our main result, that any symmetric function computed by a streaming algorithm can also 
be computed by a mud algorithm. 

2.1 Preliminaries 

As is standard, we fix the space and communication to be polylog(n)|^ 

Definition 1. A symmetric function f -.TP ^Tiis in the class MUD if there exists a polylog{n)-communication, 
polylog{n) -space mud algorithm m = ©, rf) such that for all x G S", and all computation trees T, we have 
r]{mr{:x.)) = /(x). 

Definition 2. A symmetric function f : E is in the class SS if there exists a polylog{n)-communication, 

polylog{n)-space streaming algorithm s = {a, r]) such that for all x G S" we have ri{s^{'x.)) = /(x). 

Note that for subsequences Xq, and x/j, we get s''(xq, • xp) = 3^''^^°'^ (x/3). We can apply this identity to obtain 
the following simple lemma. 

Lemma 1. Let x^ and x^ be two strings and q a state such that s^(xq) = s^(x^). Then for any string xp, we 
have s^(xq, • x^) = s''(x^ • x/3). 

Proof We have s''(xa • x^) = s'^''(^")(x^) = s"' (xp) = s9(x^ • x^) □ 

Also, note that for some / G SS, because / is symmetric, the output 7]{s^{-k)) of a streaming algorithm 
s = {a, T]) that computes it must be invariant over all permutations of the input; i.e.: 

Vx G S", permutations vr : 7?(s° (x)) = /(x) = /(7r(x)) = r/(s° (7r(x))) (1) 

This fact about the output of s does not necessarily mean that the state of s is permutation-invariant; indeed, 
consider a streaming algorithm to compute the sum of n numbers that for some reason remembers the first 
element it sees (which is ultimately ignored by the function t]). In this case the state of s depends on the order of 
the input, but the final output does not. 

2.2 Statement of the result 

We argued that streaming algorithms can simulate mud algorithms by setting (7{q,x) = ©($(2;), x), which 
implies MUD CSS. The main result in this paper is: 

Theorem 1. For any symmetric function / : ^ S computed by a g{n)-space, c{n) -communication streaming 
algorithm {cr,ri), with g{n) = Q{logn) and c{n) = Q.{[ogn), there exists a 0{c{n)) -communication, 0{g^{n))- 
space mud algorithm ($, ©, if) that also computes f. 

This immediately gives: MUD = SS. 

*The results in this paper extend to otlier sub-linear (say ^/n) space, and communication bounds in a natural way. 
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2.3 Proof of Theorem [T] 



We prove Theorem [T]by simulating an arbitrary streaming algorithm with a mud algorithm. The main challenges 
of the simulation are in 

(i) achieving polylog communication complexity in the messages sent between © operations, 

(ii) achieving polylog space complexity for computations needed to support the protocol above, and 

(iii) extending the methods above to work for an arbitrary computation tree. 

We tackle these three challenges in order with the full proof given later. 

Communication complexity. Consider the final application of © (at the root of the tree T) in a mud computa- 
tion. The inputs to this function are two messages qA,QB £ Q that are computed independently from a partition 
XA, of the input. The output is a state qc that will lead directly to the overall output r]{qc)- This is similar to 
the task Carol faces in SCM: the input is split arbitrarily between Alice and Bob, who independently process 
their input (using unbounded computational resources), but then must transmit only a single symbol from Q to 
Carol; Carol then performs some final processing (again, unbounded), and outputs an answer in S. We show: 

Theorem 2. Every function f (z SS can be computed in the SCM with communication polylog[n). 

Proof. Let s = {a, 7]) be a streaming algorithm that computes /. We assume (wlog) that the streaming algorithm 
s maintains a counter in its state q ^ Q indicating the number of input elements it has seen so far. 

We compute / in the SCM as follows. Let x^ and x^ be the partitions of the input sequence x sent to Alice 
and Bob. Alice simply runs the streaming algorithm on her input sequence to produce the state qA = s^{'ka), 
and sends this to Carol. Similarly, Bob sends qs = s^{'x.b) to Carol. Carol receives the states qA and qs, which 
contain the sizes ua and of the input sequences x^ and x^. She then finds sequences x^ and x'^ of length ua 
and ub such that qA = s^{^'a) IB = s^{'^b)- (Such sequences must exist since x^i and x^ are candidates.) 
Carol then outputs 77(5*^ (x^ • x'^)). To complete the proof: 



Space complexity. The simulation above uses space linear in the input. We now give a more space-efficient 
implementation of Carol's computation. More precisely, if the streaming algorithm uses space g{n), we show 
how Carol can use only space 0{g'^{n)); this space-efficient simulation will eventually be the algorithm used by 
© in our mud algorithm. 

Lemma 2. Let s = {a, r/) be a g{n)-space streaming algorithm with g{n) = il(log n). Then, there is a 0{g'^{n))- 
space algorithm that, given states qA,qB G Q <^nd lengths ua^tlb G [n], outputs a state qc = s''(xc), where 
xq = x'a ■ x^/o?" some x^, x^ of lengths nA,nB such that s'^(x^) = qA and s^{x'^) = qs- (If such a qc exists.) 

Proof. Note that there may be many x^ , x'^ that satisfy the conditions of the theorem, and thus there are many 
valid answers for qc. We only require an arbitrary such value. However, if we only have (7^(71) space, and g'^{n) 
is sublinear, we cannot even write down x^ and x^. Thus we need to be careful about how we find qc- 

Consider a non-deterministic algorithm for computing a valid qc. First, guess the symbols of x^ one at a time, 
simulating the streaming algorithm s'^(x^) on the guess. If after ua guessed symbols we have s'^(x'^) 7^ qA, 
reject this branch. Then, guess the symbols of x'^, simulating (in parallel) s°(x'^) and s'^^{'x'^). If after ub steps 



ri{s^{^A-^B)) 



r/(s°(x^ • y. 

77(5° (xB • y. 
77(5° (x^ • X 
/(XA • xb) 
/(x). 



xa)) 
xa)) 
xb)) 



(by Lemma [T]) 
(by©) 
(by Lemma [T]) 
(by O) 

(by the correctness of s) 



□ 
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we have s°(x'^) ^ q^, reject this branch; otherwise, output qc = s'^^ (x'^)- This is a non-deterministic, 0{g{n))- 
space algorithm for computing a valid qc- By Savitch's theorem fVT], it follows that qc can be computed by a 
deterministic, <^^(n)-space algorithm. (The application of Savitch's theorem in this context amounts to a dynamic 
program for finding a state qc such that the streaming algorithm can get from state qA to qc and from state to 
qs using the same input string of length n^-) □ 

The running time of this algorithm is super-polynomial from the use of Savitch's theorem, which dominates 
the running time in our simulation. 



Finishing the proof for arbitrary computation trees. To prove Theorem [T] we will simulate an arbitrary 
streaming algorithm with a mud algorithm, setting © to Carol's procedure, as implemented in Lemma [2l The 
remaining challenge is to show that the computation is successful on an arbitrary computation tree; we do this by 
relying on the symmetry of / and the correctness of Carol's procedure. 

Proof of Theorem^! Let / € SS and let s = {a, rj) be a streaming algorithm that computes /. We assume wlog 
that s includes in its state q the number of inputs it has seen so far. We define a mud algorithm m = (<I>, ©, ry) 
where $(3;) = cj(0, x), and using the same rj function as s uses. The function ©, given q^, qs ^ Q and input 
sizes HA, riB, outputs some qc = QA ® Qb = s^i^c) as in Lemma|2l To show the correctness of m, we need 
to show that r]{mT{^)) = /(x) for all computation trees T and all x G S". For the remainder of the proof, 
let T and x* = {x\, . . . , ) be an arbitrary tree and input sequence, respectively. The tree T is a binary in- 
tree with n leaves. Each node v in the tree outputs a state qy G Q, including the leaves, which output a state 
qi = = cr(0, X*) = s°(x*). The root r outputs qr, and so we need to prove that ri{qr) = /(x*). 

The proof is inductive. We associate with each node v a "guess sequence," x^ which for internal nodes 
is the sequence xc as in Lemma |2l and for leaves i is the single symbol x*. Note that for all nodes v, we 
have q^ = s''(x„), and the length of x^, is equal to the number of leaves in the subtree rooted at v. Define a 
frontier of tree nodes to be a set of nodes such that each leaf has exactly one ancestor in the set. (A node is 
considered an ancestor of itself.) The root itself is a frontier, as is the complete set of leaves. We say a frontier 

V = {ui, . . . , Vk} is correct if the streaming algorithm on the data associated with the frontier is correct, that 
is, 7?(s°(x )) = /(x*). Since the guess sequences of a frontier always have total length n, the 
correctness of a frontier set is invariant of how the set is ordered (by ([U). Note that the frontier set consisting of 
all leaves is immediately correct by the correctness of /. The correctness of our mud algorithm would follow from 
the correctness of the root as a frontier set, since at the root, correctness implies r/(s°(xr)) = ?7(<?r) = /(x*). 

To prove that the root is a correct frontier, it suffices to define an operation to take an arbitrary correct frontier 

V with at least two nodes, and produces another correct frontier V with one fewer node. We can then apply 
this operation repeatedly until the unique frontier of size one (the root) is obtained. Let V be an arbitrary correct 
frontier with at least two nodes. We claim that V must contain two children a, b of the same node c|Zl To obtain 
V' we replace a and b by their parent c. Clearly V' is a frontier, and so it remains to show that V' is correct. We 
can write V as {a, b,vi, . . . ,Vk}, and so y = {c, f 1 , . . . , w fc}. For ease of notation, let x = x^,^ • x^j x^j^, . 

The remainder of the argument follows the logic in the proof of Theorem |2l Observe that we now have to be 
careful that the guess for a string is the same length as the original string; this property is guaranteed in Lemma|2l 





^Xq 


• Xf, • 


x)) 


(by the correctness of V) 




^Xq, 


• Xfc • 


x)) 


(by Lemma [T]) 


r]{s^\ 


^Xq- 




x)) 


(by©) 


r]{s^\ 


[<■ 


x^ 


x)) 


(by Lemma d]) 


r]{s^\ 






x)) 


(by©) 


77(5°! 


;Xc • 


x)) 




(by Lemma |2]) 



'Proof: consider one of the nodes a £V furthest from the root. Suppose its sibling h is not in V . Then any leaf in the tree rooted at b 
must have its ancestor in V further from r than a; otherwise a leaf in the tree rooted at a would have two ancestors in V . This contradicts 
a being furthest from the root. 
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2.4 Extensions to randomized and approximation algorithms 

We have proved that any deterministic streaming computation of a symmetric function can be simulated by a 
mud algorithm. However most nontrivial streaming algorithms in the literature rely on randomness, and/or are 
approximations. Still, our results have interesting implications as described below. 

Many streaming algorithms for approximating a function / work by computing some other function g exactly 
over the stream, and from that obtaining an approximation / to /, in postprocessing. For example, sketch-based 
streaming algorithms maintain counters computed by inner products Cj = (x, Vj) where x is the input vector and 
each Vj is some vector chosen by the algorithm. From the set of Cj's, the algorithms compute /. As long as is a 
symmetric function (such as the counters), our simulation results apply to g and hence to the approximation of /: 
such streaming algorithms, approximate though they are, have equivalent mud algorithms. This is a strengthening 
of Theorem [U to approximations. 

Our discussion above can be formalized easily for deterministic algorithms. There are however some details in 
formalizing it for randomized algorithms. Informally, we focus on the class of randomized streaming algorithms 
that are order-independent for particular choices of random bits, such as all the randomized sketch-based |[2llT0l 
streaming algorithms. Formally, 

Definition 3. A symmetric function f : Y,"^ ^ T, is in the class rSS if there exists a set of polylog{n)-communication, 
polylog{n)-space streaming algorithms {s^ = (o"-^, f/'^)}/jg{o,i}fe> ^ = polylog{n), such that for all x G X", 

1- Pr^~{o,i}^ [??^(5^(x)) = /(x)] > I and 

2. for all R G {0, l}'^, and permutations vr, r]^{s^(x.)) = r/^(s-^(7r(x)). 
We define the randomized variant of MUD analogously. 

Definition 4. A symmetric function f : TP ^ is in rMUD if there exists a set of polylog{n) -communication, 
polylog{n)-space mud algorithms {m^ = {'^^^ ??^)}_rg{o,i}'=> ^ = polylog{n), such that for all x G X", 

1. for all computation trees T, we have Pr/j^{o,i}ft \r]^{mq^['x)) = /(x)] > |, and 

2. for all R G {0, 1}'^, permutations vr, and pairs of trees T, T', we have rj^[mj^['x)) = r]^(m^i (7r(x))). 

The second property in each of the definitions ensures that each particular algorithm (s^ or m^) computes a 
deterministic symmetric function after R is chosen. This makes it straightforward to extend Theorem [U to show 
rMUD = rSS. 

3 Negative Results 

In the previous section, we demonstrated conditions under which mud computations can simulate streaming 
computations. We saw, explicitly or implicitly, that we have mud algorithms for a function 

(i) that is total, ie., defined on all inputs, 

(ii) that has one unique output value, and, 

(iii) that has a streaming algorithm that, if randomized, uses public randomness. 

In this section, we show that each one of these conditions is necessary: if we drop any of them, we can 
separate mud from streaming. Our separations are based on communication complexity lower bounds in the 
SCM model, which suffices (see the "communication complexity" paragraph in Section 1231) . 
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3.1 Private Randomness 

In the definition of rMUD, we assumed that the same R was given to each component; i.e, public randomness. 
We show that this is necessary in order to simulate a randomized streaming algorithm, even for the case of total 
functions. Formally, we prove: 

Theorem 3. There exists a symmetric total function f £ rSS, such that there is no randomized mud algorithm 
for computing f using only private randomness. 

In order to prove Theorem [3l we demonstrate a total function / that is computable by a single-pass, random- 
ized polylog(n)-space streaming algorithm, but any SCM protocol for / with private randomness has communi- 
cation complexity ^}{^/n). Our proof uses a reduction from the string-equality problem to a problem that we call 
SetParity. In the later problem, we are given a collection of records S = (ii, 61), (^2, &2)i • • • j {im bn), where 
for each j G [n], we have G {0, . . . , re — 1}, and bj £ {0, 1}. We are asked to compute the following function, 
which is clearly a total function under a natural encoding of the input: 




1 if e {0, . . . , n - 1}, Ej:^,=t bj mod 2 = 
otherwise 



We give a randomized streaming algorithm that computes / using the e-biased generators of (13). Next, in order 
to lower-bound the communication complexity of a SCM protocol for SetParity, we use the fact that any SCM 
protocol for string-equality has complexity r2(^^/re) lfT4l |4]|. Due to lack of space, the remainder of the proof of 
Theorem [3] is given in the appendix. 

3.2 Promise Functions 

In many cases we would like to compute functions on an input with a particular structure (e.g., a connected graph). 
Motivated by this, we define the classes pMUD and pSS capturing respectively mud and streaming algorithms 
for symmetric functions that are not necessarily total (they are defined only on inputs that satisfy a property that 
is promised). 

Definition 5. Let A C S*^. A symmetric function f : A ^ T, is in the class pMUD if there exists a polylog{n)- 
communication, polylog{n)-space mud algorithm m = (<I>, ©, ?]) such that for all x G A, and computation trees 
T, we have r]{mT{^)) = /(x). 

Definition 6. Let A C S". A symmetric function f : A ^ T, is in the class pSS if there exists a polylog{n)- 
communication, polylog{n)-space streaming algorithm s = (o", such that for all's. £ A we have s''(x) = /(x). 

Theorem 4. pMUD C pSS. 

To prove Theorem |4j we introduce a promise problem, that we call S YMMETRlClNDEX, and we show that it 
is in pSS, but not in pMUD. Intuitively, we want to define a problem in which the input will consist of two sets 
of records. In the first set, we are given a re-bit string xi, . . . , x„, and a query index p. In the second set, we are 
given a re-bit string yi, . . . , y„, and a query index q. We want to compute either Xq, or Up, and we are guaranteed 
that Xq = Up. Formally, the alphabet of the input is S = {a, b} x [re] x {0, 1} x [re]. An input S G S^" is some 
arbitrary permutation of a sequence with the form 

S = (a, 1,2:1, p), (a, 2,X2,p), . . . , (a,n, a;„,p), (b, l,yi,9), (b, 2, 7/2,9), ■ • • , {^,n,yn,q)- 

Additionally, the set S satisfies the promise that Xq = Up. Our task is to compute the function f{S) = Xq. In 
order to prove Theorem|4]we give a deterministic polylog(n) -space streaming algorithm for S YMMETRlClNDEX, 
and we show that any deterministic SCM protocol for the same problem has communication complexity Q.{n). 
Due to lack of space, the proof appears in the Appendix. 
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3.3 Indeterminate Functions 



In some applications, the function we wish to compute may have more than one "correct" answer. We define the 
classes iMUD and iSS to capture the computation of "indeterminate" functions. 

Definition 7. A total symmetric function / : E" ^ 2^ is in the class iMUD if there exists a polylog{n)- 
communication, polylog{n)-space MUD algorithm m = (<!>,©, r/) such that for all x € Ti"", and computation 
trees T, we have r]{mq-{x.)) £ /(x). 

Definition 8. A total symmetric function f : S"^ — > 2^ is in the class iSS if there exists a polylog{n)-communication, 
polylog (n) -space streaming algorithm s = {a, rj) such that for all x G X" we have s'^(x) G /(x). 

Consider a promise function f : A ^ Ti, such that / G pMUD. We can define a total indeterminate function 
/' : ^ 2^, such that for each x £ A, f'{x) = f{x), and for each x ^ A, f{x) = S. That is, for any input that 
satisfies the promise of /, the two functions are equal, while for all other inputs, any output is acceptable for /'. 
Clearly, a streaming or mud algorithm for /', is also a streaming or mud algorithm for / respectively. Therefore, 
Theorem m implies for following result. 

Tlieorem 5. iMUD C iSS. 

4 Concluding Remarks 

Unlike conventional streaming systems that make passes over ordered data with a single processor, modern log 
processing systems like Google's MapReduce [8] and Apache's Hadoop Q rely on massive, unordered, dis- 
tributed (mud) computations to do data analysis in practice, and get speedups. Motivated by that, we have 
introduced the model of mud algorithms. Our main result is that any symmetric function that can be computed by 
a streaming algorithm can be computed by a mud algorithm as well with comparable space and communication 
resources, showing the equivalence of the two classes. At the heart of the proof is a nondeterministic simulation 
of a streaming algorithm that guesses the stream, and an application of Savitch's theorem to be space-efficient. 
This result formalizes some of the intuition that has been used in designing streaming algorithms in the past 
decade. This result has certain natural extensions to approximate and randomized computations, and we show 
that other natural extensions to richer classes of symmetric functions are impossible. 

We think the generalization of mud algorithms to reflect the full power of these modern log processing systems 
is likely to be a very exciting area of future research. In one generalization, a "multi-key" mud algorithm computes 
a function S" in a single round, where each symbol in the output is the result of a "single-key" mud 

algorithm (as we've defined it in this paper). Because generalized mud models already work in practice at 
massive scale, algorithmic and complexity-theoretic insights will have tremendous impact. 

There are other technical problems that are open and of interest. In particular, can one obtain more time- 
efficient simulation for Theorem [T]? Also, D. Sivakumar asked if there are natural problems for which this 
simulation provides an interesting algorithm ifTTl . 
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A Appendix for Section 3 



A. 1 Proof of Theorem |3] 

A randomized streaming algorithm for computing / works as follows. We pick an e-biased family of n binary 
random variables Xq, . . . , Xn-i, for some e < 1/2. Such a family has the property for any S C [n], 

Pr[^Ximod2 = 1] > 1/4. 

ies 

Moreover, this family can be constructed using O(logn) random bits, such that the value of each Xi can be 
computed in time log*^^^) n fT3,| . We can thus compute in a streaming fashion the bit B = hi ■ Xi^ + 62 • Xi^ + 
... + bn ■ Xi„. Observe that if f{S) = 1, then Pr[B = 1] = 0. On the other hand, if f{S) = 0, then let 

A = {t G {0, . . . , n - 1}| ^ bj mod 2 = 1}. 

i:ij=t 

We have 

Pr[5 = 1] = Pr[^ Xi mod 2 = 1] > 1/4. 

ieA 

Thus, by repeating in parallel the above experiment 0(log(n)) times, we obtain a randomized streaming algo- 
rithm for SetParity, that succeeds with high probability. 

It remains to show that there is no SCM protocol for SetParity with communication complexity o{y/n). 
We will use a reduction from the string equality problem lT4ll4l. Alice gets a string G {0,1}", and Bob 

gets a string yi, y„ G {0, 1}". They independently compute the sets of records Sa = {(1, a^i), . . . , (n, Xn)}, 
and Sb = {(1)2/1)) • • • ) {n^yn)}- It is easy to see that f{SA U Sb) = 1 iff the answer to the string-equality 
problem is YES. Thus, any protocol with private randomness for / has communication complexity Q.{^/n). 

A.2 Proof of Theorem l 

We start by giving a deterministic polylog(n)-space streaming algorithm for SymmetricIndex that implies 
SymmetricIndex G pSS. The algorithm is given the elements of S in an arbitrary order. If the first record is 
(a, z, Xi,p) for some i, the algorithm streams over the remaining records until it gets the record (b,p, y^, q) and 
outputs Up. If the first record is (b, j, yj, q) for some j, then the algorithm streams over the remaining records 
until it gets the record (a, q, Xq,p). In either case we output Xq = yp. 

We next show that SymmetricIndex ^ pMUD. It suffices to show that any deterministic SCM protocol 
for SymmetricIndex requires Q(n) bits of communication. Consider such a protocol in which Alice and Bob 
each send b bits to Carol, and assume for the sake of contradiction that 6 < n/40. Let / be the set of instances to 
the SymmetricIndex problem, and simple counting yields that |/| = n^2^"~^. For an instance G /, we split 
it into two pieces 4>a, for Alice and 0b> for Bob. We assume that these pieces are 

(Pa = (a, 1, xf,/), . . . , (a, n, /)) and 4>B = (b, 1, yf , q^), . . . , (b, n, yt q^). 

For this partition of the input, let I a and Ib, be the sets of possible inputs of Alice, and Bob respectively. Alice 
computes a function Ha Ia —>■ [2^], Bob computes a function Hb : Ib —>■ [2^], and each sends the result to 
Carol. Intuitively, we want to argue that if Alice sends at most n/40 bits to Carol, then for an input that is chosen 
uniformly at random from /, Carol does not learn the value of Xj for at least some large fraction of the indices i. 
We formalize the above intuition with the following Lemma: 

Lemma 3. If we pick (p £ I, and i G [n] uniformly at random and independently, then: 
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• With probability at least 4/5, there exists x 7^ 'P ^ I> such that hA{(t>A) = ^a(xa)> P'^ = P'^. ^f^d xf 7^ xf. 

• With probability at least 4/5, there exists tp 4> I, such that hB{4>B) = ^b(V's)> = Q^, cind 

yf / yf- 

Proof. Because of the symmetry between the cases for Ahce and Bob, it suffices to prove the assertion for Alice. 

For j G [2''], r £ [n],Iet 

Cj,r = {7 G I\hA{jA) = j and = r}. 
Let aj^r be the set of indices i G [n], such that x] is fixed, for all 7 G Cj^r- That is, 

aj^r = {t & Ml for all 7, 7' G Cj^r, x] = x] }. 

Observe that if we fix \aj,r\ elements Xi in all the instances in Cj,r, then any pair 7, 7' G Cj^r can differ only 
in some Xj, with i ^ Oj^r, or in the index q, or in yt, with the constraint that Xq = Up. Thus, for each j, r G [2^], 

<n-22"-|"^-'-|-\ (2) 

Thus, if \aj^r\ > n/20, then \Cj^r\ < f^2^^"/^°~^. Pick (p £ I, and i G [n] uniformly at random, and indepen- 
dently, and let £ be the event that there exists x ¥^ <P ^ such that hA{(f>A) = ^a(xa)> p'^ = and 7^ x^. 
Then 



Pr[£] = 1 



n ■ \I\ 



^ Ejep^LreM ^ " 2"^° ■n + Eje[2%reln] \Cj,r\ ■ n/20 

2^/40 . ^3 . + „2 . 22n-l . ^/20 



> 4/5, 

for sufficiently large 77. □ 

Consider an instance (f> chosen uniformly at random from /. Clearly, p'^, and are distributed uniformly in 
[77], g*^, and (pA are independent, and p'^', and (pB are independent. Thus, by Lemma [3] with probability at least 
1 — 2 (i) there exist x, "0 S I, such that: 

• hA{(pA) = hAixA), P'^ = P^, and xf^ / x^^. 



B) = hB{4>B),q'f' = q^,^ndyl,^yl,. 



Consider now the instance 7 = XA U ipB- That is, 

7 = (a,l,x^,p^),...,(a,n,x^,p'^),(b,l,yf,g'^),...,(b,n,yt9^) 

Observe that 

= x^^ (by the definition of 7) 

= 1 — X^^ 

= 1 — (by the promise for (p) 

_ -0 

— yp,t> 
= ypx 

= ypj (by the definition of 7). 



13 



Thus, 7 satisfies tlie promise of the problem (i.e., 7 G /). Moreover, we have hc{hj\^{(j)^), hsicf)^)) = 
hc{hA{j^), hsij^)), while x*^^ 7^ x^7. It follows that the protocol is not correct. We have thus shown that 
pMUD C pSS and proved Theorem SI 
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